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Abstract 

Background & Objective: Managing data from large-scale projects (such as The Cancer Genome Atlas (TCGA)) for further 
analysis is an important and time consuming step for research projects. Several efforts, such as the Firehose project, make 
TCGA pre-processed data publicly available via web services and data portals, but this information must be managed, 
downloaded and prepared for subsequent steps. We have developed an open source and extensible R based data client for 
pre-processed data from the Firehouse, and demonstrate its use with sample case studies. Results show that our 
RTCGAToolbox can facilitate data management for researchers interested in working with TCGA data. The RTCGAToolbox 
can also be integrated with other analysis pipelines for further data processing. 

Availability and implementation: The RTCGAToolbox is open-source and licensed under the GNU General Public License 
Version 2.0. All documentation and source code for RTCGAToolbox is freely available at http://mksamur.github.io/ 
RTCGAToolbox/ for Linux and Mac OS X operating systems. 
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Introduction 

The explosion of data from high throughput experiments, fueled 
by various functional genomics technologies, is expected to 
overwhelm attempts at analyzing genomics data [1,2]; this trend 
is most evident in oncogenomics, where a vast number of tumors 
have been profiled by individual laboratories. By the end of 2015, 
the Cancer Genome Atlas (TCGA) (http://cancergenome.nih. 
gov) [3] Research Network plans to achieve the ambitious goal of 
analyzing the genomic, epigenomic and gene expression profiles of 
more than 10,000 specimens from more than 25 different tumor 
types [4] . The massive amounts of information that is emerging 
from such large-scale project is becoming increasingly difficult for 
researchers to manage. 

In 2013, TCGA Research Network summarized the aims of 
TCGA project as to generate, quality control, merge, analyze and 
interpret molecular profiles at the DNA, RNA, protein and 
epigenetic levels for hundreds of clinical tumors representing 
various tumor types and their subtypes [4]; the authors also 
reported that cases that meet quality assurance specifications are 
characterized using technologies that assess the sequence of the 
exome, copy number variation, DNA methylation, mRNA 
expression and sequence, microRNA expression and transcript 
splice variation. Additional platforms applied to a subset of the 
tumors, including whole-genome sequencing and RPPAs, provide 



additional layers of data to complement the core genomic data sets 
and clinical data [4]. 

Such a deluge of data also creates problem of access and 
management for researchers. A key factor in the utility, 
sustainability and future use of a novel resource lies in its ability 
to allow for data sharing and to be interoperable with major 
international cancer research efforts [5j. In addition, Buetow et. al. 
and Saltz et. al. also underscore the importance of interoperable 
IT infrastructures that facilitate simpler data access and data 
sharing for cancer research [6,7]. To address these challenges, a 
number of tools for different genomic data platforms have been 
developed by several groups: these include GEOquery [8], 
BioMart (a simple federated query system based on a generic 
framework designed for biological storage and retrieval) [9,10] and 
web based tools such as an engine to index and annotate the 
TCGA files [11]. 

A limited number of web portals (such as canEvolve [2] and 
cBio [12,13]) are available to access and organize TCGA data for 
further analysis. The Firehose pipeline management system has 
been developed by the Broad Institute (http:/ /gdac.broadinstitute. 
org), for use in comprehensive automated and reproducible 
analyses of the data generated by TCGA [14]. However, even 
though Firehose provides pre-processed data to the research 
community, it has several limitations with regards to systematic 
access to the data, and many researchers write their own (or 
borrow) shell, Perl or Python scripts to download required files to 
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their local environment [15]. Although Firehose projects provides 
the "firehose_get" tool, which is efficient than downloading data 
from web directly for pipelines and analysis tools, it is not easily 
integrated with programming environments for post analysis. 

Here we present an open source library for access and 
management of TCGA data. RTCGAToolbox allows users to 
systematically access Firehose pre-processed data, and to organize 
it for easy management and analysis. Currendy, Firehose allows 
access to more than 7 primary data types for more than 25 cancer 
subtypes (Table 1). The library also allows users to create data 
matrices from TCGA data, without any pre-processing. RTCGA- 
Toolbox can also access the Firehose analysis pipeline to get 
GISTIC2 [16] results for questions related to copy number data. 
In addition, basic analysis functions of RTCGAToolbox facilitate 
basic comparisons and analyses as well as visualization without 
having to call external tools. Furthermore, users can hire their 
favorite R packages to develop their own pipelines for downstream 
analysis with analysis-ready matrices. Several recent publications 
[17,18,19] show that systematic access and analysis of TCGA data 
provides valuable information about cancer and helps researchers 
to improve their studies. 

Implementation 

Development of the RTCGAToolbox package was mainly 
driven by two major demands: (i) to provide a user-friendly and 
rapid data access to TCGA data processed by Firehose; and (ii) to 
provide a programmatic interface for analysis software/pipelines 
to access TCGA data systematically. RTCGAToolbox is devel- 
oped by using R programming language and provides expandable 
open source environment for future development and integration. 
Figure 1A shows a schematic overview of the RTCGAToolbox 
and its basic functionalities. 

RTCGAToolbox uses the Firehose project, one of the largest 
TCGA data sources operated by Broad Institute's Genome Data 
Analysis Center (GDAC), to access Level 3 (segmented or 
interpreted data) and Level 4 (region of interest data) pre- 
processed data. 

The first level of processing is to access Firehose reports, and 
prepare datasets and type lists in order to organize data. The main 
module of the RTCGAToolbox accesses reports via HTTP calls, 
and uses text processing functionalities to prepare required 
information for subsequent steps. To support further analysis, 
the RTCGAToolbox creates data object by parsing the default 
Firehose exports; this function can be useful for possible future 
integrations with other environments and R resources. Client 
function also allows users to access different Firehose archive dates 
programmatically. In addition to these data client functionalities, 
the current version of RTCGAToolbox facilitates basic analysis 
(differential gene expression, mutation frequencies, survival 
analysis and copy number and gene expression correlation 
analysis). Also with the S4 class defined output objects users can 
use their favorite algorithm and packages to undertake further 
downstream analysis. 

Description 

To use the RTCGAToolbox as a data client for the Firehose 
project, it is necessary to know the run dates for Firehose standard 
data and analyses. The RTCGAToolbox provides functions to list 
the "standard data run", the "analysis run", and names for valid 
dataset aliases. Users must provide valid dates and dataset aliases 
(File SI). 

One of the primary goals of this project is to allow users to 
systematically access and organize TCGA Level 3 and Level 4 



data outputs. Through its extensible structure, the RTCGATool- 
box can be integrated with R libraries, allowing R users to also 
integrate their data for further analysis. 

In addition to its data management functionalities, RTCGA- 
Toolbox allows users to perform basic analysis: it provides quick 
analysis options for deriving useful information from the data, and 
can also create circle plots to summarize the data. After the data- 
downloading step, RTCGAToolbox deletes already used com- 
pressed files, to free up disk space and users can also use stored 
data matrix files with different environments. Detailed case studies 
and user instructions are included in File S 1 . 

RTCGAToolbox Usage and Case Studies 

The current version of RTCGAToolbox can be used as an R 
library. Once users get the latest version, data client and basic 
analysis functions to be called via the R interface. Source code and 
project are currendy accessible through http:/ /mksamur.github. 
io/RTCGAToolbox/ 

RTCGAToolbox Data Client and Analysis Functions. As 
a data client tool and a functional library, RTCGAToolbox 
provides several functions for users to control the management 
process: these can be described as control, client and analysis 
functions. 

The main aim of the control functions is to provide valid date 
and dataset aliases to the users, and they are also used by client 
functions to check parameters. The Firehose project regularly 
provides one stddata run per month and four analyses runs per 
year. To access valid dates, users can call "getFirehoseRunning- 
Dates" and "getFirehoseAnalyzeDates" functions, which provide 
data and analysis runs date, respectively. Dataset aliases are also 
important for data client functionality and the "getFirehoseData- 
sets" function helps users to get the valid aliases for the datasets. 
Table 1 lists information about current dataset aliases and 
contents for each dataset. 

The core function of the library is the client function, also 
referred to as the "getFirehoseData" function, which provides a 
data client that checks the valid dates and aliases, gets the URL for 
data requested by the user, downloads the data into a working 
directory, and prepares the data matrices for downstream analysis. 
Calling the function initiates three main sub-processes. At the 
initial step, the function accesses Firehose services to get the URLs 
for user specified data types, and after which client function 
downloads the data from the Firehose TCGA data portal. Next, 
the data matrix is prepared; depending on the data type, size and 
connection speed, this process may take a shorter of longer time. 
As a default, users have to specify at least 2 parameters: "dataset" 
and "runDate" or/and "gistic2_Date". The current version of the 
RTCGAToolbox is currently capable of handling data types that 
are summarized in Table 2. 

In addition to its client functionalities, RTCGAToolbox also 
provides analysis functions, for collecting information from the 
datasets. The current version of the package comes with five basic 
functions: i) The "getDiffExpressedGenes" function provides the 
results of differential gene expression analysis. It takes sample 
barcodes to differentiate between "Normal" and "Tumor" 
samples, and compares them with linear models and empirical 
Bayesian methods provided by the limma [20] package. It also 
uses voom [21] function (from the same package) to prepare raw 
RNAseq counts for differential gene expression analysis. ii)Pre- 
vious studies show that copy number alterations may affect the 
levels of gene expression[22]. Based on the dosage effect 
hypothesis, we have integrated the "getCNGECorrelation" 
function for calculating correlations between copy number 
estimates from GISTIC2 [16] and gene expression levels, iii) All 
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R libraries 
and External 
Tools 



RTCGAToolbox 



Basic Analysis and Visualization 



V.. 



Query, Download and Organize 



L 



i 

j TCGA Raw Data 
j Pre-Processing 



Firehose Pipelines to Pre-process 
Data 



TCGA Project Raw Dat 



■ Installing via GitHub 
Code Chunk 0 (Supplementary File 2) 



■ Installing Library and Checking Required Info 
Code Chunk 1 (Supplementary File 2) 



Data Client to access TCGA Firehose data 
Code Chunk 2 (Supplementary File 2) 




■ Pre-defined analysis functions 

1 Differential Gene Expression (Supp. File 2 Code Chunk 3A) 

■ Correlation between CN and GE (Supp. File 2 Code Chunk 3B) 
Mutation Frequency (Supp. File 2 Code Chunk 3C) 
Survival (Supp. File 2 Code Chunk 3D) 



■ Reporter Figure 

■ Supplementary File 2 Code Chunk 4 



Figure 1. Overall RTCGAToolbox structure and workflow. (A) Overall representation of RTCGAToolbox layers from Firehose web portal to user 

environments. (B) Sample workflow for "BRCA" dataset. 

doi:10.1371/journal.pone.0106397.g001 



cancers are known to be caused by somatic mutations, however, 
our understanding of the mutational processes that cause these 
mutations is remarkably limited [23]. The "getMutationRate" 
function was developed to return gene mutation frequencies from 
samples that have mutation information. This function is useful if 
users want to integrate information about mutation with other 



data types, iv) Survival analysis is considered to be one of the 
methods that yields clinically valuable information. To allow the 
user to gain into survival profiles, the RTCGAToolbox has the 
"getSurvival" function, which creates sample groups by using 
levels of gene expression, compares differences between groups, 
and provides KM plots as a final product, v) And finally, the 



Table 2. Data types supported by RTCGAToolbox. 





Data Type (Parameter) 


Description 


Output Object (FirehoseData) 


Clinic 


Provides clinical information for each sample. Clinical 
information may include stage, survival time, sex, age and more. 


fd@Clinical (data frame) 


RNAseq Gene or/and 
RNAseq2_Gene Norm 


Gene level expression data from RNA-seq platforms. This parameter 
provides raw counts and normalized values. Firehose provides 
2 different algorithms for RNAseq data processing. (Data types can be 
specified by using RNAseqNorm and RNAseq2Norm parameters) 


fd@ RNASeqGene, fd@ RNASeq2GeneNorm 
(data matrix) 


miRNASeq Gene 


miRNA expression levels from next generation sequencing platforms 


fd@ miRNASeqGene (data matrix) 


CNA_SNP 


Segmented copy number alterations (in somatic cells) 


fd@ CNASNP (data frame) 


CNV_SNP 


Segmented copy number variations (in germline cells) 


fd@ CNVSNP (data frame) 


CNA_Seq 


Copy number alterations provided by next generation 
sequencing platforms 


fd@ CNAseq (data frame) 


CNA_CGH 


Copy number alterations provided by CGH array platforms 


fd@ CNACGH (a list of FirehoseCGHArray objects) 


Methylation 


Methylation data provided by array platforms 


fd@ Methylation (a list of 
FirehoseMethylationArray object) 


Mutation 


Gene level mutation information matrix 


fd@ Mutations (data frame) 


mRNA_Array 


Gene level expression data provided by array platforms 


fd@ mRNAArrayfa list of FirehosemRNAArray 
objects) 


miRNA_Array 


miRNA expression data provided by array platforms 


fd@ miRNAArray (a list of FirehosemRNAArray 
objects) 


RPPA 


Reverse phase protein array (RPPA) expression 


fd@ RPPAArray(a list of FirehosemRNAArray 
objects) 



doi:1 0.1 371 /journal.pone.01 06397.t002 
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Figure 2. Sample heatmap outputs from BRCA dataset. Panel A and B show the top differentially up and down regulated genes between 

"Cancer" and "Normal" samples by using RNASeq and microarray data respectively. 

doi:10.1371/journal.pone.0106397.g002 




Figure 3. KM plot for PIK3CA gene. A KM plot that compares the survival difference between PIK3CA, which is the gene has highest mutation 
frequency in BRCA dataset, high and low expressed samples. 
doi:1 0.1 371 /journal.pone.01 06397.g003 
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Figure 4. Summary plot for BRCA dataset. A circle plot that shows the differentially expressed genes result from RNASeq and microarray 
platform (Inner circle 1 and 2, y axis represents the fold change value, red dots are up regulated and blue dots are down regulated in cancer samples), 
copy number changes (inner third circle, blue zones represents the deletions and red circle represents the amplifications) and outer circle shows the 
genes that has mutation at least 5% of samples. 
doi:10.1371/journal.pone.0106397.g004 



RTCGAToolbox also includes the "getReport" function, which is 
a visualization tool that uses differential gene expression analysis, 
copy number data and mutation rate to visualize genome wide 
alterations with RCircos [24]. 

RTCGAToolbox Case Study. We provide below a case 
study to show the current functions of RTCGAToolbox, and to 
demostrate how to integrate its outputs with other R libraries. We 
also provide a user guide and step by step sample code in 
Figure IB, File SI and File S2. For this case study, we analyze 



breast invasive carcinoma [BRCA] mRNA, copy number, 
mutation and clinical data with the RTCGAToolbox. 

i) After installing (Figure IB, Step 0) the library via http:// 
mksamur.github.io/RTCGAToolbox/, the RTCGAToolbox can 
be called (Figure IB, Step 1). Note that the library depends on 
several other R packages and these libraries must be working 
properly (see "Known issues" below). 

ii) Users then use one valid dataset alias and stddata or/ and 
analysis date, to call the data client. Information about additional 
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data types and structure, with valid parameter names, is listed in 
Table 2. 

iii) "getFirehoseData" (Figure IB, Step 2) is the main data client 
function to process and prepare analysis matrices. This function 
returns an object that stores the requested data in matrices, lists, or 
data frames. After successfully requesting and getting data, analysis 
functions can be used to get quick results. 

iv) The "getDiffExpressedGenes" (Figure IB, Step 3A) function 
accepts an object produced by the "getFirehoseData" function. 
TCGA project produces systematic barcodes for each sample and 
the "getDiffExpressedGenes" function uses the same systematic 
approach to create "Tumor" and "Normal" sample groups. Users 
do not need to define groups separately to perform the analysis. 
TCGA project collects data from multiple platforms such as 
RNAseq and microarray platforms. If the dataset has multiple 
mRNA expression data from different platforms, the "getDiffEx- 
pressedGenes" function calculates the differential gene expression 
between the groups for each dataset separately, and returns a list 
that stores the results for each platform. Figure 2 also shows the 
heatmap outputs of differentially expressed genes. The "getDif- 
fExpressedGenes" function also provides volcano plots. And 
following the analysis, the function also has the capability to filter 
results by using fold change and p values (Figure IB, Step 3 A), to 
yield strong differences between groups. Criello et al. recendy 
showed that copy number changes are dominant in several cancer 
types [25] . To enable analysis of copy number variations, and to 
calculate the correlation between the copy number and the gene's 
expression level in paired samples, we added the "getCNGECor- 
relation" function (Figure IB, Step 3B). The function returns a list 
object that stores the resulting data frames constructed by gene 
symbol, correlation coefficient and adjusted p values for each gene. 
Criello et al. also point out that mutations also dominate several 
cancer types [25]; however, our understanding of the mutational 
processes that cause somatic mutations in most cancer classes is 
remarkably limited [23]. We thus incorporated the "getMutation- 
Rate" function for calculating the frequency of gene mutations 
(Figure IB, Step 3C). And finally, the RTCGAToolbox uses 
univariate survival analysis (Figure IB, Step 3D) and KM plots to 
show differences in survival associated with high and low levels of 
expression of individual genes. To run the survival function, users 
must provide a data frame that includes a sample barcode, time, 
and event data. This frame can be obtained from clinical data, 
which can be downloaded by use of the data client function. 
Figure 3 shows the KM plot from the output of the survival 
function. 

v) To provide a visual summary for each dataset, we have 
implemented a reporter function that creates a circle plot, 
developed for large-scale multi-sample genomic research data 
[24]. The "getReport" function (Figure IB, Step 4) uses data 
about the copy number, mutations, and results from differential 
gene expression analysis results to produce Figure 4, the summary 
figure for the BRCA dataset. The outer circle shows the gene 
symbols that are mutated in at least 5% of the samples; inner track 
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